A Bloom Filter is a probabilistic data structure used to test whether an element is a member of a set. It is designed to be very fast and extremely space-efficient, especially when working with large volumes of data where memory is constrained.
However, it comes with one trade-off:
In this chapter, we will explore the low-level design of bloom filter in detail.
Lets start by clarifying the requirements:
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a discussion between the candidate and the interviewer might unfold:
Candidate: Should the Bloom Filter support generic types (e.g., any object), or should we limit it to specific data types like strings or integers
Interviewer: Let’s assume for this design that we’re working with strings.
Candidate: What operations should the Bloom Filter support? Just add and mightContain, or do we also need to support deletion
Interviewer: Just add and mightContain. Deletion is not required in a standard Bloom Filter.
Candidate: What kind of false result is acceptable? Are we okay with false positives?
Interviewer: Yes, that’s expected. A Bloom Filter can return false positives, but it should never return false negatives.
Candidate: Should the number of hash functions (k) and size of the bit array (m) be configurable or fixed?
Interviewer: Ideally, they should be configurable at initialization. The values may vary depending on the expected number of elements and acceptable false positive rate.
Candidate: What should be the behavior if the same element is added multiple times?
Interviewer: That’s fine. Bloom Filters are idempotent. Adding the same element again should have no effect on correctness.
Candidate: Should the design support concurrency? For example, multiple threads calling add and mightContain?
Interviewer: Yes. Assume the Bloom Filter will be accessed from a multi-threaded environment.
Based on the discussion, here’s a summary of the functional requirements:
k (number of hash functions) and m (size of the bit array).add and mightContain operations should be optimized for speed, ideally executing in O(1) time.The design of a Bloom Filter is fundamentally different from typical object-oriented systems that model real-world entities. Instead of focusing on domain objects, we focus on choosing the right low-level data structures and abstractions that enable fast, memory-efficient, and thread-safe operations.
Our goal is to support two operations add(element) and mightContain(element) both in O(1) time, using minimal memory, while being safe for use in multi-threaded environments.
To meet these requirements, we must combine three key components:
Let’s break down these components and understand how they interact.
The first and most fundamental requirement is space efficiency. Since we are explicitly told not to store the actual elements, we need a compact representation of which elements have been seen.
This leads us to a bit array — a one-dimensional array where each element is a single bit (0 or 1).
k bit positions, and each of those bits is set to 1.k bits are examined. If all are 1, the element might be in the set. If any bit is 0, it is definitely not.In Java, we would typically use java.util.BitSet for this purpose, as it provides a compact, memory-efficient implementation and utilities for bit manipulation.
To populate and check the bit array, we need a way to map each element to one or more bit positions. This is the role of our hash functions.
A standard Bloom Filter uses k independent hash functions, each of which maps a string input to a bit array index between 0 and m - 1.
add() and mightContain() operations.In our Bloom Filter, we will maintain an array or list of k HashFunction implementations, each producing a different hash value for the same input.
Finally, we need a public-facing class that ties everything together and exposes a clean API to the user. This is the BloomFilter class.
It acts as the coordinator or facade, hiding the internal complexity of hashing and bit manipulation while providing two easy-to-use methods:
add(String element)mightContain(String element): booleanm bits representing the filter’s state. Supports fast setting and checking of bits.k independent hash functions used to map string elements to indices in the bit array.add() and mightContain() and manages internal data structures.Together, these core components allow us to build a Bloom Filter that is compact, and fast, all while delivering the required functionality and performance.
Now that we've identified the core components of a Bloom Filter, it's time to convert those ideas into a structured class design. This includes defining the responsibilities and attributes of each class, establishing relationships between them, applying design patterns for clean extensibility, and finally visualizing the system in a class diagram.
We begin by defining the key classes involved in the system, starting with data structures and utilities, and culminating in the main interface.
BloomFilterThis is the main class that exposes the public API (add and mightContain) and orchestrates the behavior of the internal components.
bitArray: BitSet – The core data store that tracks presence via bits.hashFunctions: List<HashFunction> – A list of k independent hash functions.size: int – The total size m of the bit array.BloomFilter(int size, List<HashFunction> hashFunctions) – Constructor to initialize size and hash functions.void add(String element) – Applies all hash functions to the input and sets the corresponding bits.boolean mightContain(String element) – Applies all hash functions and checks if all relevant bits are set.HashFunction (Interface)Defines the contract for hash functions used by the Bloom Filter.
Method:
int hash(String input, int seed, int bound) – Computes a hash of the input string, using a seed, and returns a result in the range [0, bound).This allows support for multiple independent hash functions by using different seeds.
Understanding how these classes interact clarifies system behavior and responsibilities.
BloomFilter has-a BitSetBloomFilter has-a List<HashFunction>The BloomFilter owns and manages these components. Their lifecycle is tied to the lifecycle of the filter.
BloomFilter uses HashFunction to compute bit positions.HashFunction is an abstraction that allows the filter to be configured with different implementations.We apply a few classic design principles and patterns to keep the system modular, flexible, and testable.
The HashFunction interface and its implementations follow the Strategy Pattern. This allows us to inject different hashing strategies into the filter at runtime, based on the use case or performance characteristics.
DefaultHashFunction with MurmurHashFunction , Djb2Hashwithout changing the BloomFilter class.Creating a Bloom Filter requires complex calculations to determine the optimal bit array size (m) and number of hash functions (k) based on user-friendly inputs (expected elements and desired false positive rate). We provide a public static factory method BloomFilter.create(...) that encapsulates this complexity, offering a much cleaner API to the client than a constructor that requires m and k directly.
The BloomFilter class acts as a facade, exposing a simplified interface (add and mightContain) while hiding all internal details related to hashing, bit manipulation, and concurrency handling.
This simplifies the usage for clients and makes the internal system easier to evolve without breaking external code.
HashType Enum1class HashType(Enum):
2 FNV1A = "FNV1A"
3 DJB2 = "DJB2"Defines the set of supported hash function types. This allows runtime flexibility in choosing hash algorithms and enables extensibility by adding new types later.
HashStrategy Interface and ImplementationsDefines a contract for all hashing algorithms used in the Bloom Filter. Any strategy must implement a consistent way to hash strings into long integers.
1class HashStrategy(ABC):
2 @abstractmethod
3 def hash(self, data: str) -> int:
4 pass
5
6class FNV1aHashStrategy(HashStrategy):
7 # FNV-1a 64-bit constants
8 FNV_PRIME = 0x100000001b3
9 FNV_OFFSET_BASIS = 0xcbf29ce484222325
10
11 def hash(self, data: str) -> int:
12 hash_value = self.FNV_OFFSET_BASIS
13 for byte in data.encode('utf-8'):
14 hash_value ^= byte
15 hash_value *= self.FNV_PRIME
16 return hash_value
17
18class DJB2HashStrategy(HashStrategy):
19 def hash(self, data: str) -> int:
20 hash_value = 5381
21 for byte in data.encode('utf-8'):
22 # hash = hash * 33 + c
23 hash_value = ((hash_value << 5) + hash_value) + byte
24 return hash_valueHashStrategyFactory1class HashStrategyFactory:
2 @staticmethod
3 def create(hash_type: HashType) -> HashStrategy:
4 if hash_type == HashType.FNV1A:
5 return FNV1aHashStrategy()
6 elif hash_type == HashType.DJB2:
7 return DJB2HashStrategy()
8 else:
9 raise ValueError(f"Unsupported hash type: {hash_type}")Implements the Factory Pattern to provide the appropriate hashing strategy based on the enum value. Decouples strategy instantiation from client code.
BloomFilter ClassThis is the core Bloom Filter class.
1class BloomFilter:
2 def __init__(self, bit_set_size: int, num_hash_functions: int, strategies: list[HashStrategy]):
3 self.bit_set_size = bit_set_size
4 self.num_hash_functions = num_hash_functions
5 self.bit_set = bitarray(bit_set_size)
6 self.bit_set.setall(0)
7 self.hash_strategies = strategies
8
9 def add(self, item: str):
10 for i in range(self.num_hash_functions):
11 hash_value = self.hash_strategies[i].hash(item)
12 index = abs(hash_value) % self.bit_set_size
13 self.bit_set[index] = 1
14
15 def might_contain(self, item: str) -> bool:
16 for i in range(self.num_hash_functions):
17 hash_value = self.hash_strategies[i].hash(item)
18 index = abs(hash_value) % self.bit_set_size
19 if not self.bit_set[index]:
20 return False # Definitely not in the set
21 return True # Might be in the set
22
23 class Builder:
24 def __init__(self):
25 self.bit_set_size = 0
26 self.num_hash_functions = 0
27 self.strategies = None
28
29 def with_bit_set_size(self, bit_set_size: int):
30 if bit_set_size <= 0:
31 raise ValueError("Bit set size must be positive.")
32 self.bit_set_size = bit_set_size
33 return self
34
35 def with_num_hash_functions(self, num_hash_functions: int):
36 if num_hash_functions <= 0:
37 raise ValueError("Number of hash functions must be positive.")
38 self.num_hash_functions = num_hash_functions
39 return self
40
41 def with_hash_strategies(self, strategies: list[HashStrategy]):
42 if strategies is None or len(strategies) == 0:
43 raise ValueError("At least one hash strategy must be provided.")
44 self.strategies = strategies
45 return self
46
47 def build(self) -> 'BloomFilter':
48 if self.bit_set_size == 0 or self.num_hash_functions == 0 or self.strategies is None:
49 raise ValueError("Must set bit set size, number of hash functions, and strategies.")
50
51 if len(self.strategies) < self.num_hash_functions:
52 raise ValueError(
53 f"The number of provided hash strategies ({len(self.strategies)}) "
54 f"must be at least equal to the number of hash functions required ({self.num_hash_functions})."
55 )
56
57 print(f"Creating Bloom Filter with specified parameters:")
58 print(f" - Bit set size (m): {self.bit_set_size}")
59 print(f" - Hash functions (k): {self.num_hash_functions}")
60
61 return BloomFilter(self.bit_set_size, self.num_hash_functions, self.strategies)bitSet: The binary array (bitmap) representing membership.numHashFunctions: Controls how many different hashes are computed for each item.hashStrategies: A list of HashStrategy instances used to simulate multiple independent hash functions.add(String item)Inserts an element into the Bloom filter by setting k bits in the bit array. Each hash function produces a position; that bit is set to 1.
mightContain(String item)Checks for probable presence by verifying all bits that would have been set for the element.If any bit is not set, the element is definitely not present (no false negatives). If all bits are set, the item might be present (possible false positives).
Constructing a BloomFilter requires several parameters that must be set correctly. The Builder pattern provides a fluent, readable API for this configuration. It also allows us to perform validation within the build() method, ensuring that no BloomFilter object is ever created in an invalid state.
BloomFilterDemoThe BloomFilterDemo class demonstrates how a client would use the system, highlighting its core properties: the guarantee of no false negatives and the probability of false positives.
1class BloomFilterDemo:
2 @staticmethod
3 def main():
4 # --- 1. Manually define parameters ---
5 bit_set_size = 10000
6 num_hash_functions = 2
7 expected_insertions = 1000
8
9 # --- 2. Create a list of hash strategies at runtime ---
10 # We use the Factory to get base strategies and the Decorator to create unique variations.
11 strategies = [
12 HashStrategyFactory.create(HashType.FNV1A),
13 HashStrategyFactory.create(HashType.DJB2)
14 ]
15
16 # --- 3. Build the filter using the new Builder syntax ---
17 filter = BloomFilter.Builder() \
18 .with_bit_set_size(bit_set_size) \
19 .with_num_hash_functions(num_hash_functions) \
20 .with_hash_strategies(strategies) \
21 .build()
22
23 # --- 4. Add elements to the filter ---
24 print("\n--- Adding elements to the filter ---")
25 inserted_elements = []
26 for i in range(expected_insertions):
27 element = f"user{i}@example.com"
28 inserted_elements.append(element)
29 filter.add(element)
30 print(f"{expected_insertions} elements have been added.")
31
32 # --- 5. Test for presence (no false negatives) ---
33 print("\n--- Verifying no false negatives ---")
34 has_false_negatives = False
35 for element in inserted_elements:
36 if not filter.might_contain(element):
37 print(f"FALSE NEGATIVE DETECTED FOR: {element}", file=sys.stderr)
38 has_false_negatives = True
39 break
40 if not has_false_negatives:
41 print("Success! No false negatives found. All inserted elements were detected.")
42
43 # --- 6. Test for false positives ---
44 print("\n--- Testing for false positives ---")
45 test_set_size = 10000
46 false_positives_count = 0
47 for i in range(test_set_size):
48 random_element = str(uuid.uuid4())
49 if filter.might_contain(random_element):
50 false_positives_count += 1
51 print(f"Number of false positives found: {false_positives_count} out of {test_set_size} random items.")
52
53if __name__ == "__main__":
54 BloomFilterDemo.main()This driver code demonstrates a complete workflow:
What is the primary reason for using a bit array in the design of a Bloom Filter?
For the question "Which class relationship best describes the connection between the Bloom Filter and its hash function(s)?"
I think the answer would not be Aggregation, it should be just "Composition"